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Abstract 

For DNA sequences of various species we construct the Google matrix G of Markov tran- 
sitions between nearby words composed of several letters. The statistical distribution of 
matrix elements of this matrix is shown to be described by a power law with the exponent 
being close to those of outgoing links in such scale-free networks as the World Wide Web 
(WWW). At the same time the sum of ingoing matrix elements is characterized by the 
exponent being significantly larger than those typical for WWW networks. This results 
in a slow algebraic decay of the PageRank probability determined by the distribution of 
ingoing elements. The spectrum of G is characterized by a large gap leading to a rapid 
relaxation process on the DNA sequence networks. We introduce the PageRank proximity 
correlator between different species which determines their statistical similarity from the 
view point of Markov chains. The properties of other eigenstates of the Google matrix 
are also discussed. Our results establish scale-free features of DNA sequence networks 
showing their similarities and distinctions with the WWW and linguistic networks. 
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Introduction 

The theory of Markov chains [l] finds impressive modern applications to information 
retrieval and ranking of directed networks including the World Wide Web (WWW) where 
the number of nodes is now counted by tens of billions. The PageRank algorithm (PRA) [2] 
uses the concept of the Google matrix G and allows to rank all WWW nodes in an efficient 
way. This algorithm is a fundamental element of the Google search engine used by a 
majority of Internet users. A detailed description of this method and basic properties of 
the Google matrix can be found e.g. in [3j|4]. 

The Google matrix belongs to the class of Perron-Frobenius operators naturally ap- 
pearing in dynamical systems (see e.g. |5]). Using the Ulam method [6] a discrete approx- 
imant of Perron-Frobenius operator can be constructed for simple dynamical maps follow- 
ing only one trajectory in a chaotic component [7j or using many independent trajectories 



counting their probability transitions between phase space cells [8|[9J , 10 . The studies 
of Google matrix of such directed Ulam networks provides an interesting and detailed 
analysis of dynamical properties of maps with a complex chaotic dynamics [7j[8], [9 10 



In this work we use the Google matrix approach to study the statistical properties of 
DNA sequences of the species: Homo sapiens (HS, human), Canis familiaris (CF, dog), 
Loxodonta africana (LA, elephant), Bos Taurus (bull, BT), Danio rerio (DR, zebrafish), 
taken from the publicly available database (TT1. The analysis of Poincare recurrences 



in these DNA sequences 12 shows their similarities with the statistical properties of 
recurrences for dynamical trajectories in the Chirikov standard map and other symplectic 
maps (7|. Indeed, a DNA sequence can be viewed as a long symbolic trajectory and hence, 
the Google matrix, constructed from it, highlights the statistical features of DNA from a 
new viewpoint. 



An important step in the statistical analysis of DNA sequences was done in 13 ap- 
plying methods of statistical linguistics and determining the frequency of various words 
composed of up to 7 letters. A first order Markovian models have been also proposed and 
briefly discussed in this work. Here we show that the Google matrix analysis provides a 
natural extension of this approach. Thus the PageRank eigenvector gives the frequency 
appearance of words of given length. The spectrum and eigenstates of G characterize the 
relaxation processes of different modes in the Markov process generated by a symbolic 
DNA sequence. We show that the comparison of word ranks of different species allows to 
identify proximity between species. 

At present the investigations of statistical properties of DNA sequences are actively de- 



veloped by various bioinformatic groups (see e.g. 14, 15 , 16 , 1718]). The development 
of various methods of statistical analysis of DNA sequences become now of great impor- 
tance due to a rapid growth of collected genomic data. We hope that the Google matrix 
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approach, which already demonstrated its efficiency for enormously large networks [2j|3 
will find useful applications for analysis of genomic data sets. 

Results 

Construction of Google matrix from DNA sequence 



From 11 we collected DNA sequences of HS represented as a single string of length 
L ~ 1.5 • 10 10 base pairs (bp) corresponding to 5 individuals. Similar data are obtained 
for BT (2.9 • 10 9 bp), CF (2.5 • 10 9 bp), LA (3.1 • 10 9 bp), DR (1.4 • 10 9 bp). For HS, CF, 
LA, DR the statistical properties of Poincare recurrences in these sequences are analyzed 
in 12 . All strings are composed of 4 letters A, G, G, T and undetermined letter Ni. The 
strings can be found at the web page |19|. 

For a given sequence we fix the words Wk of m letters length corresponding to the 
number of states N = 4 m . We consider that there is a transition from a state i to state 
j inside this basis iV when we move along the string from left to right going from a 
word Wk to a next word Wk+i- This transition adds one unit in the transition matrix 
element T^- — > + 1. The words with letter TV; are omitted, the transitions are counted 
only between nearby words not separated by words with N[. There are approximately 
N t « L/m such transitions for the whole length L since the fraction of undetermined 
letters Ni is small. Thus we have Nt = Ylfj=i Tij- The Markov matrix of transitions SV,- 
is obtained by normalizing matrix elements in such a way that their sum in each column 
is equal to unity: SV,- = T^j J2i Tij- If there are columns with all zero elements (dangling 
nodes) then zeros of such columns are replaced by 1/N. Such a procedure corresponds 
to one used for the construction of Google matrix of the WWW [2j[3]. Then the Google 
matrix of DNA sequence is written as 

Gij = aS tJ + (1 - oi)/N, (1) 

where a is the damping factor for which the Google search uses usually the value a ~ 0.85 
[3] . The matrix G belongs to the class of Perron- Frobenius operators. It has the largest 
eigenvalue A = Ai = 1 with all other eigenvalues |Aj| < a. For WWW usually there 
are isolated subspaces so that at a = 1 there are many degenerate A = 1 eigenvalues |I] 
so that the damping factor allows to eliminate this degeneracy creating a gap between 
A = 1 and all other eigenvalues. For our DNA Google matrices we find that there is 
already a significant spectral gap naturally present. In this case the PageRank vector is 
not sensitive to the damping factor being in the range 0.5 < a < 1 (other eigenvectors 
are independent of a [3j|4], [9]). Due to that in the following we present all results at the 
value a = 1. 

The spectrum Aj and right eigenstates ipi(j) are determined by the equation 



E 



Gu'W) = \Mi)- (2) 
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The PageRank eigenvector P(j) at A = 1 has positive or zero elements which can be 
interpreted as a probability to find a random surfer on a given site j with the total 
probability normalized to unity ^2 ■ P(j) = 1. Thus, all sites can be ordered in a decreasing 
order of probability P(j) that gives us the PageRank order index K(j) with most frequent 
sites at low values of K — 1, 2, .... 

It is useful to consider the density of matrix elements Gkk 1 in the PagePank indexes 
K, K' similar to the presentation used in 20 21 for networks of Wikipedia, UK universi- 
ties, Linux Kernel and Twitter. The image of the DNA Google matrix of HS is shown in 
Fig. [TJfor words of 5 and 6 letters. We see that almost all matrix is full that is drastically 



different from the WWW and other networks considered in 20 where the matrix G is 



very sparse. Thus the DNA Google matrix is more similar to the case of Twitter which 



is characterized by a strong connectivity of top PageRank nodes 21 



It is interesting to analyze the statistical properties of matrix elements Gij. Their 
integrated distribution is shown in Fig. [2j Here N g is the number of matrix elements of 
the matrix G with values G^ > g. The data show that the number of nonzero matrix 
elements Gy is very close to N 2 . The main fraction of elements has values G^ < 1/N 
(some elements G^ < 1/N since for certain j there are many transitions to some node 
il with Tj/j 3> N and e.g. only one transition to other i" with T^j = 1). At the same 
time there are also transition elements G^ with large values whose fraction decays in 
an algebraic law N g m AN/g u ~ 1 with some constant A and an exponent v. The fit of 
numerical data in the range —5.5 < log 10 <7 < —0.5 of algebraic decay gives for m = 6: 
v = 2.46±0.025 (BT), 2.57±0.025 (CF), 2.67±0.022 (LA), 2.48±0.024 (HS), 2.22±0.04 
(DR). For HS case we find v = 2.68 ± 0.038 at m = 5 and v = 2 A3 ± 0.02 at m = 7 
with the average A m 0.003 for m = 5,6, 7. There are visible oscillations in the algebraic 
decay of N g with g but in global we see that on average all species are well described 
by a universal decay law with the exponent v ~ 2.5. For comparison we also show the 
distribution N g for the WWW networks of University of Cambridge and Oxford in year 
2006 (data from |4j[20]). In these networks we have N w 2 • 10 5 and on average 10 links 
per node. We see that in these cases the distribution N g has a very short range in which 
the decay is at least approximately algebraic (—5.5 < log 10 (A^/A^ 2 ) < —6). In contrast 
to that for the DNA sequences we have a large range of algebraic decay. 

Since in each column we have the sum of all elements equal to unity we can say 
that the differential fraction dN g /dg oc 1/g" gives the distribution of outgoing matrix 
elements which is similar to the distribution of outgoing links extensively studied for the 



WWW networks [3j[23j, 24,25 . Indeed, for the WWW networks all links in a column 
are considered to have the same weight so that these matrix elements are given by an 
inverse number of outgoing links (3]. Usually the distribution of outgoing links follows 
a power law decay with an exponent v w 2.7 even if it is known that this exponent is 
much more fluctuating compared to the case of ingoing links. Thus we establish that 
the distribution of DNA matrix elements is similar to the distribution of outgoing links 
in the WWW networks with v ~ v. We note that for the distribution of outgoing 
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links of Cambridge and Oxford networks the fit of numerical data gives the exponents 
v = 2.80 ± 0.06 (Cambridge) and 2.51 ± 0.04 (Oxford). 

It is known that on average the probability of PageRank vector is proportional to the 
number of ingoing links |3|. This relation is established for scale-free networks with an 
algebraic distribution of links when the average number of links per node is about 10 to 
100 that is usually the case for WWW, Twitter and Wikipedia networks (4,20 , 21 22 



23 24 , 25 . Thus in such a case the matrix G is very sparse. For DNA we find an 
opposite situation where the Google matrix is almost full and zero matrix elements are 
practically absent. In such a case an analogue of number of ingoing links is the sum of 
ingoing matrix elements g s = YljLi^ij- The integrated distribution of ingoing matrix 
elements with the dependence of N s on g s is shown in Fig. |3j Here N s is defined as 
the number of nodes with the sum of ingoing matrix elements being larger than g s . A 
significant part of this dependence, corresponding to large values of g s and determining 
the PageRank probability decay, is well described by a power law N s pa BN/g%~ 1 . The fit 
of data at m = 6 gives fx = 5.59±0.15 (BT), 4.90±0.08 (CF), 5.37±0.07 (LA), 5.11 ±0.12 
(HS), 4.04 ± 0.06 (DR). For HS case at m = 5, 7 we find respectively fx = 5.86 ± 0.14 and 
4.48 ± 0.08. For HS and other species we have an average B pa 1. 

Usually for ingoing links distribution of WWW and other networks one finds the 



exponent fx pa 2.1 23 24 , [25]. This value of fx is expected to be the same as the exponent 
for ingoing matrix elements of matrix G. Indeed, for the ingoing matrix elements of 
Cambridge and Oxford networks we find respectively the exponents fx = 2.12 ± 0.03 and 
2.06 ±0.02 (see curves in Fig. [3j. For ingoing links distribution of Cambridge and Oxford 
networks we obtain respectively fx = 2.29 ±0.02 and fx = 2.27 ±0.02 which are close to the 
usual WWW value fx pa 2.1. Thus we can say that for the WWW type networks we have 
/x pa fx. In contrast the exponent fx for DNA Google matrix elements gets significantly 
larger value fx pa 5. This feature marks a significant difference between DNA and WWW 
networks. 

For DNA we see that there is a certain curvature in addition to a linear decay in log-log 
scale. From one side, all species are close to a unique universal decay curve which describes 
the distribution of ingoing matrix elements g s (there is a more pronounced deviation for 
DR which does not belong to mammalian species). However, from other side we see visible 
differences between distributions of various species (e.g. non mammalian DR case has the 
largest deviation from others mammalian species). We will discuss the links between fx 
and the exponent of PageRank algebraic decay P{K) oc 1/K° in next sections. 

Spectrum of DNA Google matrix 



The spectrum of eigenstates of DNA Google matrix G of HS is shown in Fig. [4] for 
words of m = 5,6,7 letters and matrix sizes N = 4 m . The spectra for DNA sequences of 
bull BT, dog CF, elephant LA and zebrafish DR are shown in Fig. [5] for words of m = 6 
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letters. The spectra and eigenstates are obtained by direct numerical diagonalization of 
matrix G using LAPACK standard code. 

In all cases the spectrum has a large gap which separates eigenvalue A = 1 and all 
other eigenvalues with |A| < 0.5 (only for non mammalian DR case we have a small group 
of eigenvalues within 0.5 < |A| < 0.75). This is drastically different from the spectrum of 
WWW and other type networks which usually have no gap in the vicinity of A = 1 (see 
e.g. [4, 21 , (22]). In a certain sense the DNA G spectrum is similar to the spectrum of 



randomized WWW networks and the spectrum of G of the Albert-Baraasi network model 



discussed in 26 , but the properties of the PageRank vector are rather different as we will 
see below. 

Visually the spectrum is mostly similar between HS and CF having approximately the 
same radius of circular cloud |A| < A c ~ 0.2. For DR this radius is the smallest with 
A c ~ 0.1. Thus the spectrum of G indicates the difference between mammalian and non 
mammalian sequences. For HS the increase of the word length m = 5; 6; 7 leads to an 
increase of A c ~ 0.1; 0.2; 0.35. For m = 7 the number of nonzero matrix elements GV, 
is close to iV 2 and thus on average we have only about L/(mN 2 ) w 8 transitions per 
each element. This determines an approximate limit of reliable statistical computation of 
matrix elements G^ for available HS sequence length L. For HS at m = 6 we verified that 
two halves of the whole sequence L still give practically the same spectrum with a relative 
accuracy of A A/ A ~ 0.01 for eigenvalues in the main part of the cloud at A c /3 < | A| < A c . 
This means that the spectrum presented in Figs 4|5 is statistically stable at the values of 
L used in this work. 

We also constructed the Google matrix G* by inverting the direction of transitions 
Tij — > Tji and then normalizing sum of all elements in each column to unity. This 
procedure is also equivalent to moving along the sequence, from word to word, not from 
left to right but from right to left. We note that for WWW and other networks such a 
matrix with inverted direction of links was used to obtain the CheiRank vector (which is 
the PageRank vector of matrix G*). Due to the inversion of links the CheiRank vector 
highlights very communicative nodes (4 , 20 , 21,22 . In our case the spectrum of G and 



G* are identical. As a result the probability distributions of PageRank and CheiRank 
vectors are the same. This is due to some kind of detailed balance principle: we count only 
transitions between nearby words in a DNA sequence and the direction of displacement 
along the sequence does not affect the average transition probabilities so that = Tji 
(up to statistical fluctuations). In a certain sense this situation is similar to the case of 
Ulam networks in symplectic maps where the conservation of phase space area leads to 
the same properties of G and G* (7,10 



We tried to test if a random matrix model can reproduce the distribution of eigenvalues 
in A plane. With this aim we generated random matrix elements Gij with exactly the 
same distribution N g as for HS case at m = 6 (see Fig. [2]). However, in this random 
model we found all eigenvalues homogeneously distributed in the radius A c ~ 0.07 being 
significantly smaller compared to the real data. Also in this case the PageRank probability 
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P{K) changes only by 30% in the whole range 1 < K < N being absolutely different from 
the real data (see next section). Thus the construction of random matrix models which are 
able to produce results similar to the real data remains as a task for future investigations. 

PageRank properties of various species 



By numerical diagonalization of the Google matrix we determine the PageRank vector 
P(K) at A = 1 and several other eigenvectors with maximal values of |A|. The dependence 
of probability P on index K is shown in Fig. [6]for various species and different word length 
m. The probability P{K) describes the steady state of random walks on the Markov chain 
and thus it gives the frequency of appearance of various words of length m in the whole 
sequence L. The frequencies or probabilities of words appearance in the sequences have 
been obtained in 13 by a direct counting of words along the sequence (the available 
sequences L were shorted at that times). Both methods are mathematically equivalent 
and indeed our distributions P(K) are in a good agreements with those found in [13] even 
if now we have a significantly better statistics. 

The decay of P with K can be approximately described by a power law P ~ 
Thus for example for HS sequence at m = 7 we find (3 = 0.357 ± 0.003 for the fit range 



1.5 < log 10 i^ < 3.7 that is rather close to the exponent found in 13 . Since on average 
the PageRank probability is proportional to the number of ingoing links, or the sum of 
ingoing matrix elements of G, one has the relation between the exponent of PageRank 



j3 and exponent of ingoing links (or matrix elements): /? = l/(/x — 1) |3],|4], 23 24 , 25 
Indeed, for the HS DNA case at m = 7 we have /i = 4.48 that gives = 0.29 being close 
to the above value of (3 = 0.357 obtained from the direct fit of P(K) dependence. We 
think that the agreement is not so perfect since there is a visible curvature in the log-log 
plot of N s vs g s in Fig. |3j Also due to a small value of (3 the variation range of P is not 
so large that reduces the accuracy of the numerical fit even if a formal statistical error 
is relatively small compared to a visible systematic nonlinear variations. In spite of this 
only approximate agreement we should say that in global the relation between j3 and \x 
works correctly. In average we find for DNA network the value of [i ~ 5 being significantly 
larger than for the WWW networks with jl « 2.1 |3|. This gives a significantly smaller 
value f3 ~ 0.25 for DNA case comparing to the usual WWW value f3 ~ 0.9 (we note that 
the randomized WWW networks and the Albert-Barabasi model have (3 m 1 [26]). The 
relation between (3 and fi also works for the DR DNA case at m = 6 with [i = 4.04 that 
gives (3 = 0.33 being in a satisfactory agreement with the fit value f3 = 0.426 found from 
P(K) dependence of Fig. [6j 

At m = 6 we find for our species the following values of exponent (3 = 0.273 ± 0.005 
(BT), 0.340 ±0.005 (CF), 0.281 ±0.005 (LA), 0.308 ± 0.005 (HS), 0.426 ± 0.008 (DR) in 
the range 1 < log 10 i^ < 3.3. There is a relatively small variation of (3 between various 
mammalian species. The data of Fig. [6] for HS show that the value of (3 remains stable 
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with the increase of word length. These observations are similar to those made in 13 
PageRank proximity between species 

The top ten 6-letters words, with largest probabilities P(K), are given for all studied 
species in Table [TJ Two top words are identical for BT, CF, HS. To see a similarity 
between species on a global scale it is convenient to plot the PageRank index K s (i) of a 
given species s versus the index Kh s {i) of HS for the same word i. For identical sequences 
one should have all points on diagonal, while the deviations from diagonal characterize the 
differences between species. The examples of such PageRank proximity K — K diagrams 
are shown in Figs. 7j8 for words at m — 6. A zoom of data on a small scale at the range 



1 < K < 200 is shown in Fig. [9j A visual impression is that CF case has less deviations 
from HS rank compared to BT and LA. The non-mammalian DR case has most strong 
deviations from HS rank. For BT, CF and LA cases we have a significant reduction of 
deviations from diagonal around K ps 3N/4. This effect is also visible for DR case even 
if being less pronounced. We do not have explanation for this observation. 

The fraction of purine letters A or G in a word of m = 6 letters is shown by color 
in Fig. [7] for all words ranked by PageRank index K. We see that these letters are 
approximately homogeneously distributed over the whole range of K values. In contrast 
to that the distribution of letters A or T is inhomogeneous in K : their fraction is dominant 
for 1 < K < iV/4, approximately homogeneous for N/4 < K < 3iV/4 and is close 
to zero for 3iV/4 < K < N (see Fig. [8]). We find that in the whole HS sequence the 
fractions F a>Cj g jt of A, C, G, T are respectively 0.276596, 0.192576, 0.192624, 0.276892 (and 
F n = 0.061312 for undetermined Ni). Thus we have the fraction of A,G being close to 
1/2 ps (F a + F g )/(1 - F n ) = 0.499867 and the fraction of A, T being (F a + F t )/(1 - F n ) = 
0.589640 > 0.5. Thus it is more probable to have A or T in the whole sequence that 
can be a possible origin of the inhomogeneous distribution of A or T along K and large 
fraction of A, T at top PageRank positions. 

The whole HS sequence used here is composed from 5 humans with individual length 
Li ps 3 • 10 9 ~ L/5. We consider the first and last fifth parts of the whole sequence 
L separately thus forming two independent sequences HS1 and HS2 of two individuals. 
We determine for the the corresponding PageRank indexes Kh s i and Kh S 2 and show their 



PageRank proximity diagram in Fig. 10 In this case the points are much closer to diagonal 
compared to the case of comparison of HS with other species. 

To characterize the proximity between different species or different HS individuals we 

compute the average dispersion cr(si,s 2 ) = y^^^jj^sjf) — K S2 {i)) 2 )/N between two 
species (individuals) S\ and S2- Comparing the words with length m = 5, 6, 7 we find that 
the scaling a cc N works with a good accuracy (about 10% when iV is increased by a factor 
16). To represent the result in a form independent of m we compare the values of a with 
the corresponding random model value a rn d. This value is computed assuming a random 
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distribution of N points in a square NxN when only one point appears in each column and 
each line (e.g. at m = 6 we have a rn d m 1673 and a rn d oc N). The dimensionless dispersion 
is then given by C( s i; s 2) = cr(si, ^/(Tmd- From the ranking of different species we obtain 
the following values at m = 6: ((CF, BT) = 0.308; ((LA, BT) = 0.324, ((LA,CF) = 
0.303; ((HS,BT) = 0.246, ((HS,CF) = 0.206, ((HS, LA) = 0.238; ((DR, BT) = 0.425, 
((DR, CF) = 0.414, ((DR, LA) = 0.422, ((£>#, HS) = 0.375 (other m have similar 
values). According to this statistical analysis of PageRank proximity between species we 
find that ( value is minimal between CF and HS showing that these are two most similar 
species among those considered here. 

For two HS individuals we find ((HS1, HS2) = 0.031 being significantly smaller then 
the proximity correlator between different species. We think that this PageRank proximity 
correlator ( can be useful as a quantitative measure of statistical proximity between 
various species. 

Finally, in Table [2] we give for all species the words of 6 letters with the 10 minimal 
PageRank probabilities. Thus for HS the less probable is the word TACGCG correspond- 
ing to two amino acids Tyr and Ala. In general the ten last words are mainly composed 
of C and G even if the letters A and T still have small but nonzero weight. The last two 
words are the same for mammalian species but they are different for DR sequence. 

Other eigenvectors of G 



The properties of 10 eigenstates ipi(j) of DNA Google matrix with largest modulus 



of eigenvalues |Aj| are analyzed in Table 3 and Fig. 11 The words W% at the maximal 



amplitude \ipi(j)\ are presented for all species in Table [3J We see that in general these 
words Wi are rather different from the top PageRank word W\ (some words appear in 
pairs since there are pairs of complex conjugated values Aj = A*). 

The probability of the above top 10 eigenstates as a function of PageRank index K are 
shown in Fig. [TT] We see that the majority of the vectors, different from the PageRank 
vector, have well localized peaks at relatively large values K > 50. This shows that in 
the DNA network there are some modes located on certain specific patterns of words. 

To illustrated the localized structure of eigenmodes ipi(j) for HS case at m = 6 we 
compute the inverse participation ratio = • \ipi(j)\ 2 ) 2 / l^i(i)| 4 which gives an 
approximate number of nodes on which the main probability of an eigenstate ipi(j) is 
located (see e.g. |4)|2l||26)). The obtained values are & = 385.26, 16.37, 2.07, 1.72, 2.23, 
3.19, 77.43, 77.43, 2.33, 2.06 for i = 1, ...10 respectively. We see that for i > 1 we have 
significantly smaller £ values compared to the case of PageRank vector with a large £i. 
This supports the conclusion about localized structure of a large fraction of eigenvectors 
of G. 

In 22 on an example of Wikipedia network it is shown that the eigenstates with rela- 
tively large |A| select specific communities of the network. The detection of communities 
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in complex networks is now an active research direction 27 . We expect that the eigen 



modes of G matrix can select specific words of bioniformatic interest. However, a detailed 
analysis of words from eigenmodes remains for further more detailed investigations. 



Discussion 

In this work we used long DNA sequences of various species to construct from them 
the Markov process describing the probabilistic transitions between words of up to 7 
letters length. We construct the Google matrix of such transitions with the size up to 
4 7 and analyze the statistical properties of its matrix elements. We show that for all 5 
species, studied in this work, the matrix elements of significant amplitude have a power 
law distribution with the exponent v ~ 2.5 being close to the exponent of outgoing 
links distribution typical for WWW and other complex directed networks with v ~ 2.7. 
The distribution of significant values of the sum of ingoing matrix elements of G is also 
described by a power law with the exponent /i ~ 5 which is significantly larger than 
the corresponding exponent for WWW networks with jx pa 2.1. We show that similar 
to the WWW networks the exponent \i determines the exponent j3 = l/(// — 1) ~ 0.25 
of the algebraic PageRank decay which is significantly smaller then its value for WWW 
networks with /3 ~ 0.9. The PageRank decay is similar to the frequency decay of various 
words studied previously in [13] . It is interesting to note that the value — 1 is close to 
the exponent of Poincare recurrences decay which has a value close to 4 [12] (even if we 
cannot derive a direct mathematical relation between them). 

Using PageRank vectors of various species we introduce the PageRank proximity cor- 
relator ( which allows to measure in a quantitative way the proximity between different 
species. This parameter remains stable in respect to variation of the word length. 

The spectrum of the Google matrix is determined and it is shown that it is charac- 
terized by a significant gap between A = 1 and other eigenvalues. Thus, this spectrum is 
qualitatively different from the WWW case where the gap is absent at the damping factor 
a = 1. We show that the eigenmodes with largest values of |A| < 1 are well localized on 
specific words and we argue that the words corresponding to such localized modes can 
play an interesting role in bioinformatic properties of DNA sequences. 

Finally we would like to trace parallels between the Google matrix analysis of words 
in DNA sequences and the small world properties of human language. Indeed, it is known 
that the frequency of words in natural languages follows a power law Zipf distribution with 



the exponent f3 ~ 1 28 . The parallels between words distributions in DNA sequences and 



statistical linguistics were already pointed in 13 . The analysis of degree distributions 



of undirected networks of words in natural languages was found to follow a power law 



with an exponent V\ ~ 1.5 — 2.7 29 being not so far from the one found here for the 



matrix elements distribution. It is argued that the language evolution plays an important 



role in the formation of such a distribution in languages 30 . The parallels between 
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linguistics and DNA sequence complexity are actively discussed in bioinformatics 31 32 



We think that the Google matrix analysis can provide new insights in the construction 
and characterization of information flows on DNA sequence networks extending recent 



steps done in 33 



In summary, our results show that the distributions of significant matrix elements 
are similar to those of the scale-free type networks like WWW, Wikipedia and linguistic 
networks. In analogy with lingusitic networks it can be useful to go from words network 
analysis to a more advanced functional level of links inside sentences that may be viewed 
as a network of links between amino acids or more complex biological constructions. 
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Figure 1. DNA Google matrix of Homo sapiens (HS) constructed for words of 5-letters 
(top) and 6-letters (bottom) length. Matrix elements Gkk 1 are shown in the basis of 
PageRank index K (and K'). Here, x and y axes show K and K' within the range 
1 < K, K' < 200 (left) and 1 < K, K' < 1000 (right). The element G n at K = K' = 1 is 
placed at top left corner. Color marks the amplitude of matrix elements changing from 
blue for minimum zero value to red at maximum value. 
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Figure 2. Integrated fraction N g /N 2 of Google matrix elements with Gij > g as a 
function of g. Left panel : Various species with 6-letters word length: bull BT 
(magenta), dog CF (red), elephant LA (green), Homo sapiens HS (blue) and zebrafish 
DR(black). Right panel : Data for HS sequence with words of length m = 5 (brown), 6 
(blue), 7 (red). For comparison black dashed and dotted curves show the same 
distribution for the WWW networks of Universities of Cambridge and Oxford in 2006 
respectively. 
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Figure 3. Integrated fraction N s /N of sum of ingoing matrix elements with 



in same colors. 



i 1 . j — 9s- Left and right panels show the same cases as in Fig. 
The dashed and dotted curves are shifted in re- axis by one unit left tolit the figure scale. 
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Figure 4. Spectrum of eigenvalues in the complex plane A for DNA Google matrix of 
Homo sapiens (HS) shown for words of 5, 6, 7 letters (from top to bottom). 
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Figure 5. Spectrum of eigenvalues in the complex plane A for DNA Google matrix of of 
bull BT, dog CF, elephant LA, zebrafish DR shown for words of 6 letters (from top to 
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Figure 6. Dependence of PageRank probability P{K) on PageRank index K. Left 
panel : Data for different species for word length of 6-letters: bull BT (magenta), dog 
CF (red), elephant LA (green), Homo sapiens HS (blue) and zebrafish DR (black). Right 
panel : Data for HS (full curve) and LA (dashed curve) for word length m = 5 (brown), 
6 (blue/green), 7 (red). 
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Figure 7. PageRank proximity K — K plane diagrams for different species in 
comparison with Homo sapiens: x-axis shows PageRank index Kh s {i) of a word i and 
?/-axis shows PageRank index of the same word i with Kbt(i) of bull, K c f(i) of dog, 
Ki a (i) of elephant and Kd r {i) of zebrafish; here the word length is m = 6. The colors of 
symbols marks the purine content in a word i (fractions of letters A or G in any order) ; 
the color varies from red at maximal content, via brown, yellow, green, light blue, to 
blue at minimal zero content. 
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Figure 8. Same as in Fig. g but now the color marks the fraction of of letters A or T in 
any order in a word i with red at maximal content and blue at zero content. 
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Figure 9. Zoom of the PageRank proximity K — K diagram of Fig. 
1 < K < 200 with the same color for A or T content. 
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Figure 10. PageRank proximity K — K diagram of Homo sapiens HS2 versus Homo 
sapiens HS1 at m = 6 (see text for details). Top panels show the content of A, T (left) 
and A, G (right) in the same way as in Fig. [8] and Fig. [7] respectively. Bottom panels 
show zoom of top panels. 
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Figure 11. Dependence of eigenstates amplitude \ipi(K)\ on PageRank index K in 
x-axis and eigenvalue index i in y-axis for largest ten eigenvalues |Aj| counted by i from 
% = 1 at | Ai| = 1 to % = 10 at | Axol ~ 0-2- The range 1 < K < 250 is shown with 
PageRank vector for a given species at the bottom line of each panel. For each species in 
each panel the color is proportional to changing from blue at zero to red at 

maximal amplitude value which is close to unity in each panel. The panels show the 
species: bull BT (top left), dog CF (top right), elephant LA (bottom left), Homo 
sapiens HS (bottom right). 



Table 1. Top ten PageRank entries at DNA word length m = 6 for species: bull BT, 
dog CF, elephant LA, Homo sapiens HS and zebrafish DR. 



BT 


CF 


LA 


HS 


DR 






AAAAAA 




ATATAT 


AAAAAA 


AAAAAA 




AAAAAA 


TATATA 


ATTTTT 


AATAAA 


ATTTTT 


ATTTTT 


AAAAAA 


AAAAAT 


TTTATT 


AAAAAT 


AAAAAT 


TTTTTT 


TTCTTT 


AAATAA 


AGAAAA 


TATTTT 


AATAAA 


TTTTAA 


TTATTT 


TTTTCT 


AAAATA 


TTTATT 


AAAGAA 


AAAAAT 


AAGAAA 


TTTTTA 


AAATAA 


TTAAAA 


ATTTTT 


TTTCTT 


TAAAAA 


TTATTT 


TTTTCT 


TTTTTA 


TTTTTA 


TTATTT 


CACACA 


AGAAAA 


TAAAAA 


TAAAAA 


AAATAA 


TGTGTG 
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Table 2. Ten words with minimal PageRank probability given at m = 6 for species: 
bull BT, dog CF, elephant LA, Homo sapiens HS and zebrafish DR. Here the top row is 
the last PageRank entry, bottom is the tenth one from the end of PageRank. 



BT 


CF 


LA 


HS 


DR 


CGCGTA 


TACGCG 


CGCGTA 


TACGCG 


CCGACG 


TACGCG 


CGCGTA 


TACGCG 


CGCGTA 


CGTCGG 


CGTACG 


TCGCGA 


ATCGCG 


CGTACG 


CGTCGA 


CGATCG 


CGTACG 


TCGCGA 


TCGACG 


TCGACG 


ATCGCG 


CGATCG 


CGCGAT 


CGTCGA 


TCGTCG 


CGCGAT 


CGAACG 


GTCGCG 


CGATCG 


CCGTCG 


TCGACG 


CGTTCG 


CGATCG 


CGTTCG 


CGACGG 


CGTCGA 


TCGACG 


CGCGAC 


CGAACG 


CGACCG 


CGTTCG 


CGTCGA 


TCGCGC 


CGACGA 


CGGTCG 


TCGTCG 


ACGCGA 


ACGCGA 


CGCGAA 


CGACGA 
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Table 3. Words Wi corresponding to the maximum value of eigenvector modulus 
Wi = maxj(\ipi(j)\) for species bull BT, dog CF, elephant LA, Homo sapiens HS and 
zebrafish DR, which are shown in dark red in Fig. 11 The eigenvectors at i = 1, 10 
correspond to the ten largest eigenvalues |Ai|, |Aio| of the DNA Google matrix for 
DNA word length m = Q. The first row i — 1 corresponds to top PageRank entries. 



i 


BT 


CF 


LA 


HS 


DR 


1 






AAAAAA 




ATATAT 


2 




AAAAAA 


AAAAAA 




TATATA 


3 


ACACAC 


CTCTCT 


AAAAAA 


ACACAC 


ATATAT 


4 


ACACAC 


AGAGAG 


AAAAAA 


ACACAC 


TAGATA 


5 


CACACA 


CTCTCT 


AAAAAA 




ATAGAT 


6 


CACACA 


TCTCTC 


AAAAAA 


CACACA 


TATCTA 


7 


CCAGGC 


AGAGAG 


TATGAG 


TGGGAG 


ATCTAT 


8 


CCAGGC 


AGAGAG 


TATGAG 


TGGGAG 


TAGATA 


9 


CCCATG 


TGTGTG 




CACACA 


ATAGAT 


10 


CCCATG 


TGTGTG 


AGAGTA 




TATCTA 



